Introduction

Part 1 Recap

In our Part 1, we analyzed a hotel booking dataset from the INN Hotels Group to understand the factors influencing booking cancellations.

hotel_data <- read.csv("../../Dataset/INNHotelsGroup_min.csv")
hotel_datay <- read.csv("../../Dataset/INNHotelsGroup.csv")
str(hotel_datay)
## 'data.frame':    36275 obs. of  19 variables:
##  $ Booking_ID                          : chr  "INN00001" "INN00002" "INN00003" "INN00004" ...
##  $ no_of_adults                        : int  2 2 1 2 2 2 2 2 3 2 ...
##  $ no_of_children                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ no_of_weekend_nights                : int  1 2 2 0 1 0 1 1 0 0 ...
##  $ no_of_week_nights                   : int  2 3 1 2 1 2 3 3 4 5 ...
##  $ type_of_meal_plan                   : chr  "Meal Plan 1" "Not Selected" "Meal Plan 1" "Meal Plan 1" ...
##  $ required_car_parking_space          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ room_type_reserved                  : chr  "Room_Type 1" "Room_Type 1" "Room_Type 1" "Room_Type 1" ...
##  $ lead_time                           : int  224 5 1 211 48 346 34 83 121 44 ...
##  $ arrival_year                        : int  2017 2018 2018 2018 2018 2018 2017 2018 2018 2018 ...
##  $ arrival_month                       : int  10 11 2 5 4 9 10 12 7 10 ...
##  $ arrival_date                        : int  2 6 28 20 11 13 15 26 6 18 ...
##  $ market_segment_type                 : chr  "Offline" "Online" "Online" "Online" ...
##  $ repeated_guest                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ no_of_previous_cancellations        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ no_of_previous_bookings_not_canceled: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ avg_price_per_room                  : num  65 106.7 60 100 94.5 ...
##  $ no_of_special_requests              : int  0 1 0 0 0 1 1 1 1 3 ...
##  $ booking_status                      : chr  "Not_Canceled" "Not_Canceled" "Canceled" "Canceled" ...

The dataset contains hotel bookings collected from 2017-2018 and is characterized by a total of 19 columns, comprising 5 categorical and 14 numerical variables.

Exploratory Data Analysis (EDA)

1. Summary Statistics

summary(hotel_data)
##   no_of_adults  no_of_children  no_of_weekend_nights no_of_week_nights
##  Min.   :0.00   Min.   : 0.00   Min.   :0.00         Min.   : 0.0     
##  1st Qu.:2.00   1st Qu.: 0.00   1st Qu.:0.00         1st Qu.: 1.0     
##  Median :2.00   Median : 0.00   Median :1.00         Median : 2.0     
##  Mean   :1.84   Mean   : 0.11   Mean   :0.81         Mean   : 2.2     
##  3rd Qu.:2.00   3rd Qu.: 0.00   3rd Qu.:2.00         3rd Qu.: 3.0     
##  Max.   :4.00   Max.   :10.00   Max.   :7.00         Max.   :17.0     
##  type_of_meal_plan  required_car_parking_space room_type_reserved   lead_time  
##  Length:36275       Min.   :0.000              Length:36275       Min.   :  0  
##  Class :character   1st Qu.:0.000              Class :character   1st Qu.: 17  
##  Mode  :character   Median :0.000              Mode  :character   Median : 57  
##                     Mean   :0.031                                 Mean   : 85  
##                     3rd Qu.:0.000                                 3rd Qu.:126  
##                     Max.   :1.000                                 Max.   :443  
##   arrival_year  arrival_month   market_segment_type repeated_guest 
##  Min.   :2017   Min.   : 1.00   Length:36275        Min.   :0.000  
##  1st Qu.:2018   1st Qu.: 5.00   Class :character    1st Qu.:0.000  
##  Median :2018   Median : 8.00   Mode  :character    Median :0.000  
##  Mean   :2018   Mean   : 7.42                       Mean   :0.026  
##  3rd Qu.:2018   3rd Qu.:10.00                       3rd Qu.:0.000  
##  Max.   :2018   Max.   :12.00                       Max.   :1.000  
##  no_of_previous_cancellations no_of_special_requests booking_status    
##  Min.   : 0.00                Min.   :0.00           Length:36275      
##  1st Qu.: 0.00                1st Qu.:0.00           Class :character  
##  Median : 0.00                Median :0.00           Mode  :character  
##  Mean   : 0.02                Mean   :0.62                             
##  3rd Qu.: 0.00                3rd Qu.:1.00                             
##  Max.   :13.00                Max.   :5.00                             
##  avg_price_per_room
##  Min.   :  0       
##  1st Qu.: 80       
##  Median : 99       
##  Mean   :103       
##  3rd Qu.:120       
##  Max.   :540

Summary:
The average price per room in the dataset is 103 euros, with a median of 99 euros, but prices can reach up to 540 euros, indicating high-priced outliers. Some entries even show an average price of zero, possibly reflecting promotional deals. Guests typically stay for two weekday nights and one weekend night, with the average number of weekday nights being 2.2 and weekend nights 0.81. Stays can extend to as many as 17 weekday nights. Most bookings involve two adults, and many guests do not bring children. Lead times vary significantly, with an average of 85 days, a median of 57, and some bookings made up to 443 days in advance, suggesting a right-skewed distribution. The data also shows sparse previous cancellations, with an average of just 0.02, and a maximum of 58. Bookings are spread over 2017 and 2018, peaking in August. Additionally, most guests do not make special requests, as the median is zero, although some make up to five requests per booking.

2. Distribution of Booking Status

print(ggplot(hotel_data, aes(x = booking_status)) +
  geom_bar(fill = "lightblue") +
  labs(title = "Booking Status Distribution", x = "Booking Status", y = "Count"))

Summary
The dataset contains 36,275 bookings. Of the total bookings, 24,390 bookings were not canceled (67.2%), while 11,885 bookings were canceled (32.8%), reflecting a significant cancellation rate that offers rich insights into factors influencing booking decisions.

3. Barplot: Number of Special Requests

print(ggplot(hotel_data, aes(x = factor(no_of_special_requests))) +
  geom_bar(fill = "lightblue") +
  labs(title = "Number of Special Requests", x = "Number of Special Requests", y = "Count") +
  theme_minimal())

Summary
Most guests make no special requests, with a sharp decline as the number increases. A significant portion makes one request, while two or more requests are increasingly rare.

4. Plot: Market Segment vs Booking Status

contingency_table <- table(hotel_data$market_segment_type, hotel_data$booking_status)

plot_data <- as.data.frame(contingency_table)
colnames(plot_data) <- c("Market_Segment", "Booking_Status", "Count")

print(ggplot(plot_data, aes(x = Market_Segment, y = Count, fill = Booking_Status)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Booking Status by Market Segment",
       x = "Market Segment",
       y = "Count",
       fill = "Booking Status") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)))

Summary
Online market segment have the highest cancellation count, with a large proportion of canceled bookings of 8475. Offline segment has a cancellation count of 3253 For other market segment, the cancellation rates are too less.

5. Histogram: Lead_time

ggplot(hotel_data, aes(x = lead_time, fill = booking_status)) +
  geom_histogram(binwidth = 10, position = "dodge") +
  labs(
    title = "Lead Time vs Booking Status",
    x = "Lead Time (Days)",
    y = "Number of Bookings",
    fill = "Booking Status"
  ) +
  theme_minimal()

Summary
Here we can see that the booking with short lead times are less cancelled and as the lead time increases there are more booking cancellations. Also, we can see more cancellation happening between lead time form 100-200.

6. Boxplot: avg_price_per_room

ggplot(hotel_data, aes(x = booking_status, y = avg_price_per_room, fill = booking_status)) +
  geom_boxplot() +
  labs(
    title = "Average Room Price vs Booking Status",
    x = "Booking Status",
    y = "Average Room Price"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Summary
The booking with cancelled status has higher median average room price value compared to non-cancelled booking status.This might mean that the booking with her average room price have more chances to get cancelled.

Key Statistical Findings

1. Special Requests and Cancellations:
Bookings with special requests had a significantly lower cancellation rate (20.2%) compared to those without (43.2%).  Chi-square test revealed a statistically significant association between special requests and booking status.

2. Previous Cancellation History:
Surprisingly, guests with previous cancellations had a much lower current cancellation rate (4.73%) compared to those without (33.03%).
This suggests that past cancellation history might not be a straightforward predictor of future booking behavior.

3. Factors Influencing Cancellations:
3.1 Lead Time:

  1. Average lead time for canceled bookings: 139.2 days
  2. Average lead time for non-canceled bookings: 58.9 days
  3. Longer lead times strongly correlated with higher cancellation probability.

3.2 Room Price:

  1. Canceled bookings had a higher average room price.
  2. Price difference: Approximately 10.7 units higher for canceled bookings.

4. Seasonal Variations:
The correlation between factors and cancellations varied across seasons:

4.1 Fall:
Strongest lead time correlation (0.54).

4.2 Summer:
Special requests most negatively correlated with cancellations (-0.30).

4.3 Spring:
Unique pattern with special requests having a strong negative correlation (-0.36).

4.4 Winter:
Slightly different dynamics with weaker correlations.

hotel_data$booking_status_binary <- ifelse(hotel_data$booking_status == "Canceled", 1, 0)

remove_zero_variance <- function(df) {
  df[, sapply(df, function(col) sd(col, na.rm = TRUE) != 0)]
}

data_filtered <- remove_zero_variance(select_if(hotel_data, is.numeric))

cor_data <- cor(data_filtered, use = "complete.obs")

corrplot(cor_data, method = "color", addCoef.col = "black", 
         title = "Correlation Matrix Recap", number.cex = 1, 
         tl.cex = 0.8, mar = c(1, 1, 2, 1))

Lead time strongly predicts cancellations, with longer lead times increasing likelihood (0.44). More special requests (-0.25) and repeated guest status (-0.11) reduce cancellations, reflecting customer commitment and loyalty. Price per room has a weak positive correlation (0.14) with cancellations. Factors like travel party size and stay length show minimal impact on cancellation likelihood.

Preliminary Conclusions:

1. Lead time emerged as the most critical factor in predicting booking cancellations.
2. Special requests significantly reduce the likelihood of cancellations.
3. Booking behavior varies considerably across different seasons.
4. Higher-priced rooms show a slight tendency towards more cancellations.

Part 2 Agenda

In this phase, we aim to build a predictive model for hotel booking cancellations and explore key factors driving cancellation behavior. The analysis will address the following critical questions:

Can we predict the likelihood of booking cancellation based on the lead time?
Can we predict peak booking times and high-demand periods?
What is the relationship between room price and the likelihood of a booking being canceled?
How do seasonal trends impact cancellation rates and booking patterns?
Which factors most strongly influence booking cancellation decisions?

To achieve these objectives, we will analyze customer attributes such as room price, lead time, and seasonal trends. Using logistic and Random forest models, we aim to predict the likelihood of booking cancellations and identify actionable insights. The results will help hotels effectively manage cancellation risks, optimize revenue strategies, and better anticipate high-demand periods.

SMART Questions:

1. Can we predict the likelihood of booking cancellation based on the lead time?

Given that lead time demonstrated the strongest correlation with booking status, we initially selected lead time as our primary predictor variable and booking status as our target variable to begin our analysis.

set.seed(123)
train_index <- createDataPartition(hotel_data$booking_status_binary, p = 0.7, list = FALSE)
train_data <- hotel_data[train_index, ]
test_data <- hotel_data[-train_index, ]
logistic_model <- glm(
  booking_status_binary ~ lead_time,
  data = train_data, 
  family = binomial()
)

summary(logistic_model)
## 
## Call:
## glm(formula = booking_status_binary ~ lead_time, family = binomial(), 
##     data = train_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.82038    0.02328   -78.2   <2e-16 ***
## lead_time    0.01180    0.00019    62.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 32083  on 25392  degrees of freedom
## Residual deviance: 27091  on 25391  degrees of freedom
## AIC: 27095
## 
## Number of Fisher Scoring iterations: 4

The analysis reveals a significant positive relationship between lead time and booking cancellation probability. As the lead time increases, the log-odds of a booking being canceled also increase, with a coefficient of 0.01180.  This suggests that bookings made further in advance are more likely to be canceled. Both the intercept and lead time coefficient are statistically highly significant (p-value < 0.001), indicating that lead time is a reliable and strong predictor of cancellation risk.

test_predictions <- predict(logistic_model, newdata = test_data, type = "response")
predicted_classes <- ifelse(test_predictions > 0.5, 1, 0)
confusion_matrix <- table(Actual = test_data$booking_status_binary, 
                          Predicted = predicted_classes)
print("Confusion Matrix:")
## [1] "Confusion Matrix:"
print(confusion_matrix)
##       Predicted
## Actual    0    1
##      0 6720  570
##      1 2102 1490
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
precision <- confusion_matrix[2,2] / sum(confusion_matrix[,2])
recall <- confusion_matrix[2,2] / sum(confusion_matrix[2,])
f1_score <- 2 * (precision * recall) / (precision + recall)

cat("\nModel Performance Metrics:\nAccuracy:", round(accuracy, 4), "\nPrecision:", round(precision, 4), "\nRecall:", round(recall, 4), "\nF1 Score:", round(f1_score, 4), "\n")
## 
## Model Performance Metrics:
## Accuracy: 0.754 
## Precision: 0.723 
## Recall: 0.415 
## F1 Score: 0.527

1. Accuracy: 0.754, meaning the model correctly classifies 75.4% of the bookings.
2. Precision: 0.723, indicating that 72.3% of the predictions of canceled bookings are correct.
3. Recall: 0.415, meaning the model correctly identifies 41.5% of the actual canceled bookings.
4. F1-Score: 0.527, which is the harmonic mean of precision and recall, providing a balanced evaluation of the model’s performance.

roc_curve <- roc(test_data$booking_status_binary, test_predictions)
plot(roc_curve, main = "ROC Curve for Booking Cancellation Prediction")

The plot indicates that as the specificity (the ability to correctly identify non-canceled bookings) increases, the sensitivity (the ability to correctly identify canceled bookings) also increases. This suggests that the model has good discriminative power in predicting booking cancellations.

auc_value <- auc(roc_curve)
cat("Area Under the ROC Curve (AUC):", round(auc_value, 4), "\n")
## Area Under the ROC Curve (AUC): 0.75

Area Under the ROC Curve (AUC): 0.75, suggesting the model has good discriminative power in predicting booking cancellations.

hotel_data$booking_status_binary <- as.numeric(as.character(hotel_data$booking_status_binary))

ggplot(hotel_data, aes(x = lead_time, y = booking_status_binary)) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = TRUE) +
  labs(
    title = "Probability of Booking Cancellation vs Lead Time",
    x = "Lead Time (Days)",
    y = "Probability of Cancellation"
  ) +
  theme_minimal()

The graph shows the relationship between the lead time (in days) and the probability of booking cancellation. As the lead time increases, the probability of booking cancellation rises in a non-linear fashion, with the curve becoming more steep at higher lead times.

2. Can we predict high-demand periods?

We created a high-demand column by taking the average price per room and identifying bookings in the top 25% of prices. In our random forest model, we used features like lead time, total nights, booking month, market segment, and repeated guest status to predict these high-demand bookings.

prepare_booking_data <- function(hotel_data) {
  booking_data <- hotel_data %>%
    mutate(
      high_demand = as.factor(ifelse(
        avg_price_per_room > quantile(avg_price_per_room, 0.75), 
        1, 0
      )),
      booking_month = as.factor(arrival_month),
      is_weekend = no_of_weekend_nights > 0,
      total_nights = no_of_weekend_nights + no_of_week_nights,
      is_repeated_guest = repeated_guest > 0
    )
  
  return(booking_data)
}
prepared_data <- prepare_booking_data(hotel_data)
set.seed(123)
train_index <- createDataPartition(prepared_data$high_demand, p = 0.7, list = FALSE)
train_data <- prepared_data[train_index, ]
test_data <- prepared_data[-train_index, ]
# Random Forest Model
rf_model <- randomForest(
  high_demand ~ lead_time + 
                total_nights + 
                booking_month + 
                market_segment_type + 
                is_repeated_guest,
  data = train_data,
  ntree = 500,
  importance = TRUE
)

evaluate_model <- function(actual, predicted, model_name) {
  conf_matrix <- confusionMatrix(as.factor(predicted), as.factor(actual))
  
  results <- data.frame(
    Model = model_name,
    Accuracy = conf_matrix$overall['Accuracy'],
    Precision = conf_matrix$byClass['Precision'],
    Recall = conf_matrix$byClass['Recall'],
    F1 = conf_matrix$byClass['F1']
  )
  
  return(results)
}

rf_pred <- predict(rf_model, newdata = test_data, type = "prob")[,2]
rf_class <- ifelse(rf_pred > 0.5, 1, 0)
rf_results <- evaluate_model(test_data$high_demand, rf_class, "Random Forest")
print(rf_results)
##                  Model Accuracy Precision Recall    F1
## Accuracy Random Forest     0.81     0.847  0.912 0.878

The model performed impressively, with 81% overall accuracy. It correctly identified 91.2% of actual high-demand bookings (recall) and had 84.7% precision in its high-demand predictions.

rf_importance <- data.frame(
  Feature = rownames(importance(rf_model)),
  Importance = importance(rf_model)[,1]
) %>% arrange(desc(Importance))
print(head(rf_importance, 10))
##                                 Feature Importance
## booking_month             booking_month      100.9
## lead_time                     lead_time       92.9
## market_segment_type market_segment_type       85.6
## total_nights               total_nights       50.4
## is_repeated_guest     is_repeated_guest       12.9
ggplot(head(rf_importance, 10), aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(
    title = "Top 10 Features Predicting High-Demand Periods",
    x = "Features",
    y = "Importance"
  ) +
  theme_minimal()

Feature importance revealed that booking month emerged as the most critical predictor, followed closely by lead time and market segment. This suggests that when a booking is made, its timing and market segment are strong indicators of whether it will be a high-value reservation.

predict_high_demand_periods <- function(model, test_data) {
  pred_probs <- predict(model, newdata = test_data, type = "prob")[,2]
  high_demand_periods <- test_data[pred_probs > 0.7, ]
  
  print("High-Demand Periods Prediction:")
  print(paste("Total High-Demand Periods:", sum(pred_probs > 0.7)))
  print("Sample of High-Demand Periods:")
  print(head(high_demand_periods))
  
  return(high_demand_periods)
}

high_demand_periods <- predict_high_demand_periods(rf_model, test_data)
## [1] "High-Demand Periods Prediction:"
## [1] "Total High-Demand Periods: 1462"
## [1] "Sample of High-Demand Periods:"
##     no_of_adults no_of_children no_of_weekend_nights no_of_week_nights
## 62             1              0                    0                 3
## 110            2              0                    2                 1
## 181            2              0                    0                 3
## 202            2              0                    0                 2
## 223            2              0                    1                 5
## 232            2              1                    1                 3
##     type_of_meal_plan required_car_parking_space room_type_reserved lead_time
## 62        Meal Plan 1                          0        Room_Type 4        19
## 110       Meal Plan 1                          0        Room_Type 1        32
## 181       Meal Plan 1                          0        Room_Type 1        11
## 202       Meal Plan 1                          0        Room_Type 1        80
## 223       Meal Plan 1                          0        Room_Type 1        56
## 232       Meal Plan 1                          0        Room_Type 1        77
##     arrival_year arrival_month market_segment_type repeated_guest
## 62          2018             5              Online              0
## 110         2017             9              Online              0
## 181         2017             9              Online              0
## 202         2017             7              Online              0
## 223         2018             9              Online              0
## 232         2018             5              Online              0
##     no_of_previous_cancellations no_of_special_requests booking_status
## 62                             0                      2   Not_Canceled
## 110                            0                      3   Not_Canceled
## 181                            0                      1   Not_Canceled
## 202                            0                      1       Canceled
## 223                            0                      0   Not_Canceled
## 232                            0                      1   Not_Canceled
##     avg_price_per_room booking_status_binary high_demand booking_month
## 62               120.1                     0           1             5
## 110               94.5                     0           0             9
## 181               80.8                     0           0             9
## 202               76.5                     1           0             7
## 223              119.0                     0           0             9
## 232              135.2                     0           1             5
##     is_weekend total_nights is_repeated_guest
## 62       FALSE            3             FALSE
## 110       TRUE            3             FALSE
## 181      FALSE            3             FALSE
## 202      FALSE            2             FALSE
## 223       TRUE            6             FALSE
## 232       TRUE            4             FALSE

This function use the random forest model to predict probabilities for high-demand bookings in the test data and it identifies high-demand periods by filtering for predictions with a probability greater than 0.7.

3. What is the relationship between room price and the likelihood of a booking being canceled or not?

lrmodel1 <- glm(booking_status_binary ~ avg_price_per_room, data = hotel_data, family = binomial)

summary(lrmodel1)
## 
## Call:
## glm(formula = booking_status_binary ~ avg_price_per_room, family = binomial, 
##     data = hotel_data)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.632519   0.036523   -44.7   <2e-16 ***
## avg_price_per_room  0.008695   0.000326    26.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 45887  on 36274  degrees of freedom
## Residual deviance: 45153  on 36273  degrees of freedom
## AIC: 45157
## 
## Number of Fisher Scoring iterations: 4

Here’s the interpretation of the logistic regression summary:

The positive coefficient for avg_price_per_room (0.0087, p-value < 2e-16) indicates that as room price increases, the likelihood of a booking being canceled also slightly increases. While statistically significant, the effect size is modest, suggesting that room price alone has a small influence on cancellations.

hotel_data$predicted_prob <- predict(lrmodel1, type = "response")

hotel_data$predicted_class <- ifelse(hotel_data$predicted_prob > 0.5, 1, 0)

# Model evaluation
conf_matrix <- table(hotel_data$booking_status_binary, hotel_data$predicted_class)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(paste("Accuracy:", round(accuracy, 2)))
## [1] "Accuracy: 0.67"
roc_curve <- roc(hotel_data$booking_status_binary, hotel_data$predicted_prob)
plot(roc_curve)

auc_value <- auc(roc_curve)
print(paste("AUC:", round(auc_value, 2)))
## [1] "AUC: 0.6"

The AUC score of 0.6 suggests that the model has limited ability to distinguish between canceled and non-canceled bookings based solely on room price.

# Boxplot for average price per room vs booking cancellation status
boxplot(avg_price_per_room ~ booking_status_binary, data = hotel_data, 
        main = "Boxplot of Avg Price per Room vs Booking Status", 
        xlab = "Booking Status", ylab = "Average Price per Room")

Key observations:

Median Price: The median room price for both non-canceled and canceled bookings is quite similar, suggesting that the central tendency of room prices does not vary significantly with booking status.

Distribution Spread: Both groups have a wide spread in room prices, but canceled bookings (status = 1) show slightly more variability in higher price ranges. The interquartile range (IQR) is comparable for both groups, indicating that most bookings (whether canceled or not) occur within a similar price range.

Outliers: There are more outliers in higher room price ranges for canceled bookings, suggesting that cancellations are more frequent among high-price bookings.

The average price per room has a slight association with booking cancellations, particularly at higher price ranges. However, the difference in the overall distribution of prices between canceled and non-canceled bookings is minimal.

5. What key factors most strongly influence booking cancellations?

hotel_data$type_of_meal_plan <- as.factor(hotel_data$type_of_meal_plan)
hotel_data$room_type_reserved <- as.factor(hotel_data$room_type_reserved)
hotel_data$market_segment_type <- as.factor(hotel_data$market_segment_type)
hotel_data$arrival_month <- as.factor(hotel_data$arrival_month)
rf_model1 <-randomForest(booking_status_binary ~ no_of_adults + no_of_children + no_of_weekend_nights + 
                          no_of_week_nights + type_of_meal_plan + required_car_parking_space + 
                          room_type_reserved + lead_time + arrival_year + arrival_month + 
                          market_segment_type + repeated_guest + 
                          no_of_previous_cancellations + avg_price_per_room + no_of_special_requests,
                        data = hotel_data, importance = TRUE, ntree = 100)

feature_importance <- importance(rf_model1)

importance_df <- data.frame(Feature = rownames(feature_importance), Importance = feature_importance[,1])

importance_df <- importance_df[order(-importance_df$Importance),]
importance_df
##                                                   Feature Importance
## lead_time                                       lead_time     214.99
## no_of_special_requests             no_of_special_requests     183.66
## arrival_month                               arrival_month     109.45
## avg_price_per_room                     avg_price_per_room     104.81
## market_segment_type                   market_segment_type      97.86
## no_of_week_nights                       no_of_week_nights      52.54
## no_of_weekend_nights                 no_of_weekend_nights      48.14
## required_car_parking_space     required_car_parking_space      42.51
## no_of_adults                                 no_of_adults      33.90
## type_of_meal_plan                       type_of_meal_plan      32.13
## room_type_reserved                     room_type_reserved      31.48
## arrival_year                                 arrival_year      25.43
## no_of_children                             no_of_children      19.48
## repeated_guest                             repeated_guest      10.73
## no_of_previous_cancellations no_of_previous_cancellations       4.74
ggplot(importance_df, aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +  # Flip the axes to make it easier to read
  labs(title = "Feature Importance from Random Forest Model",
       x = "Feature", y = "Importance") +
  theme_minimal()

The feature importance analysis reveals key factors that influence hotel booking cancellations:

Key Drivers:
Lead Time: The most important factor, with longer lead times increasing the likelihood of cancellations. This suggests that customers are more likely to cancel bookings made far in advance.
Number of Special Requests: Highly influential, indicating that customers with fewer special requests are more likely to cancel, potentially reflecting less commitment to their bookings.

Moderately Influential Factors:
Seasonality and Pricing: Arrival month and average price per room are significant predictors, showing that cancellations are influenced by seasonal trends and pricing strategies.
Market Segment: The type of market segment (e.g., corporate, leisure) also has a considerable impact, reflecting differences in booking behavior across customer groups.

Secondary Factors:
Stay Duration: The number of weekend and weekday nights has a moderate impact, suggesting that longer stays might be more stable.
Customer Demographics: Features like the number of adults and children in a booking contribute to predictions but are less critical.

Least Important Factors:
Repeated Guest Status: Returning guests are less likely to cancel but have limited overall impact.
Previous Cancellations: Surprisingly, the number of past cancellations has minimal influence, suggesting that other factors play a more critical role in predicting cancellations.

# Converting booking status to numeric
hotel_data$booking_status_binary <- as.numeric(hotel_data$booking_status_binary)
hotel_data$market_segment_type_num <- as.numeric(hotel_data$market_segment_type)
hotel_data$arrival_month_num <- as.numeric(hotel_data$arrival_month)

corr_data <- hotel_data[, c("booking_status_binary", "lead_time", "no_of_special_requests", 
                            "avg_price_per_room", "market_segment_type_num", "no_of_week_nights", 
                            "no_of_weekend_nights", "arrival_month_num")]

#correlation matrix
cor_matrix <- cor(corr_data)


# Plotting the correlation matrix using corrplot
corrplot(cor_matrix, method = "circle", type = "lower", tl.col = "black", tl.srt = 45)

Strong Positive Correlation:
Lead Time and Booking Status: Lead time shows a strong positive correlation with booking cancellations, indicating that longer lead times are more likely to result in cancellations. This aligns with its high feature importance.

Strong Negative Correlation:
Number of Special Requests and Booking Status: There is a strong negative correlation, suggesting that bookings with more special requests are less likely to be canceled, likely reflecting higher customer commitment.

Moderate Correlations:
Market Segment and Booking Status: The market segment shows a moderate positive correlation with cancellations, highlighting the role of booking channels in cancellation behavior.
Average Price Per Room and Booking Status: There is a weaker but positive correlation, indicating that higher room prices are slightly more associated with cancellations.

Weak or Negligible Correlations:
Features such as arrival month show weak correlations with booking status, suggesting a limited role of seasonality on its own.

The correlation matrix confirms that lead time and the number of special requests are the strongest predictors of booking cancellations. Moderate relationships with market segment and room price also provide actionable insights. However, other features show weak correlations, implying their limited independent impact on cancellations.

Building booking cancellation predictive model using the key factors

Checking for multicollinearity

lm_model1 <- lm(booking_status_binary ~ lead_time + no_of_special_requests + avg_price_per_room + 
                 market_segment_type + no_of_week_nights + no_of_weekend_nights + arrival_month,
                 data = hotel_data)

gvif_values <- vif(lm_model1)

print(gvif_values)
##                        GVIF Df GVIF^(1/(2*Df))
## lead_time              1.25  1            1.12
## no_of_special_requests 1.21  1            1.10
## avg_price_per_room     1.44  1            1.20
## market_segment_type    1.70  4            1.07
## no_of_week_nights      1.08  1            1.04
## no_of_weekend_nights   1.06  1            1.03
## arrival_month          1.36 11            1.01

The Variance Inflation Factor (VIF) analysis indicates that multicollinearity is not a concern in the model, as all predictors have GVIF values below the commonly accepted threshold of 5. This suggests that the predictors are not highly correlated with one another, ensuring the stability and reliability of the model’s coefficient estimates.

Building model using Random Forest

We selected Random forest classifier for its ability to handle non-linear relationships and capture feature interactions effectively. Its performance was evaluated using metrics like accuracy, sensitivity, specificity, and balanced accuracy.

hotel_data$booking_status_binary <- as.factor(hotel_data$booking_status_binary)


rf_model2 <- randomForest(booking_status_binary ~ lead_time + no_of_special_requests + avg_price_per_room + 
                          market_segment_type + no_of_week_nights + no_of_weekend_nights + arrival_month, 
                          data = hotel_data, ntree = 100)

print(rf_model2)
## 
## Call:
##  randomForest(formula = booking_status_binary ~ lead_time + no_of_special_requests +      avg_price_per_room + market_segment_type + no_of_week_nights +      no_of_weekend_nights + arrival_month, data = hotel_data,      ntree = 100) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 10.8%
## Confusion matrix:
##       0    1 class.error
## 0 22868 1522      0.0624
## 1  2386 9499      0.2008

Model Summary
The random forest classification model demonstrated strong performance in predicting hotel booking cancellations. Using key predictors such as lead time, number of special requests, average room price, and market segment type, the model achieved an Out-of-Bag (OOB) error rate of 10.8%, indicating reliable generalization to unseen data. The confusion matrix highlights the model’s ability to accurately predict non-cancellations (class 0) with a low class error rate of 6.24%, while cancellations (class 1) had a slightly higher error rate of 20.08%.

These results underline the effectiveness of random forest in handling non-linear relationships and capturing feature interactions. The insights from the model—such as the importance of lead time and special requests—offer actionable recommendations for managing booking cancellations more effectively. Future work could address the higher misclassification rate for cancellations by exploring class imbalance solutions or incorporating additional features, such as customer demographics or cancellation policies. Overall, the model provides a robust foundation for data-driven decision-making in the hospitality industry.

pred_class_rf <- predict(rf_model2, newdata = hotel_data)

# Confusion Matrix 
conf_matrix_rf <- confusionMatrix(pred_class_rf, hotel_data$booking_status_binary)
conf_matrix_rf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 23496  1592
##          1   894 10293
##                                         
##                Accuracy : 0.931         
##                  95% CI : (0.929, 0.934)
##     No Information Rate : 0.672         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.842         
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.963         
##             Specificity : 0.866         
##          Pos Pred Value : 0.937         
##          Neg Pred Value : 0.920         
##              Prevalence : 0.672         
##          Detection Rate : 0.648         
##    Detection Prevalence : 0.692         
##       Balanced Accuracy : 0.915         
##                                         
##        'Positive' Class : 0             
## 
conf_matrix_df <- as.data.frame(as.table(conf_matrix_rf))
conf_matrix_df
##   Prediction Reference  Freq
## 1          0         0 23496
## 2          1         0   894
## 3          0         1  1592
## 4          1         1 10293
colnames(conf_matrix_df) <- c("Actual", "Predicted", "Freq")  # Rename columns for clarity

# Plot confusion matrix as heatmap using ggplot2
ggplot(conf_matrix_df, aes(x = Predicted, y = Actual, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), color = "black", size = 6) +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(x = "Predicted", y = "Actual", title = "Confusion Matrix Heatmap") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The confusion matrix provides detailed insights into the performance of the random forest model for predicting hotel booking cancellations:

True Negatives (TN):
23,496 non-canceled bookings (class 0) were correctly predicted as non-canceled.

False Positives (FP):
894 non-canceled bookings (class 0) were incorrectly predicted as canceled (class 1).

False Negatives (FN):
1,592 canceled bookings (class 1) were incorrectly predicted as non-canceled (class 0).

True Positives (TP):
10,293 canceled bookings (class 1) were correctly predicted as canceled

pred_probs_rf <- predict(rf_model2, newdata = hotel_data, type = "prob")[, 2]

roc_curve_rf <- roc(hotel_data$booking_status_binary, pred_probs_rf)

# Plot the ROC curve
plot(roc_curve_rf, main = "ROC Curve for Random Forest Model")

# Print AUC value
auc_value_rf <- auc(roc_curve_rf)
cat("AUC:", round(auc_value_rf, 4), "\n")
## AUC: 0.986
# Extracting Precision, Recall, and Accuracy from the confusion matrix
precision <- conf_matrix_rf$byClass['Pos Pred Value']
recall <- conf_matrix_rf$byClass['Sensitivity']
accuracy <- conf_matrix_rf$overall['Accuracy']

cat("Precision:", round(precision, 4), "\n")
## Precision: 0.936
cat("Recall:", round(recall, 4), "\n")
## Recall: 0.963
cat("Accuracy:", round(accuracy, 4), "\n")
## Accuracy: 0.931

The random forest model demonstrated excellent predictive performance for hotel booking cancellations. With an overall accuracy of 93.1% and a balanced accuracy of 91.5%, the model reliably differentiates between canceled and non-canceled bookings. The Kappa statistic of 0.842 further indicates strong agreement between predictions and actual outcomes, accounting for chance.

Key performance metrics include:

Sensitivity (Non-cancellations): 96.3%, showing the model’s high ability to correctly identify non-canceled bookings.
Specificity (Cancellations): 86.6%, reflecting good performance in identifying cancellations.
Positive Predictive Value (Precision): 93.7%, indicating confidence in predicting non-cancellations.
Negative Predictive Value: 92%, showing reliability in predicting cancellations.
AUC: 0.986 means the model can correctly differentiate between canceled (class 1) and non-canceled (class 0) bookings in 98.6% of cases.

Validating the model by cross-validation with Random Forest using 5-fold cross-validation

cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)

ntree_val <- 50

# cross-validation with Random Forest using 5-fold cross-validation
cv_model <- train(booking_status_binary ~ lead_time + no_of_special_requests + avg_price_per_room + 
                  market_segment_type + no_of_week_nights + no_of_weekend_nights + arrival_month, 
                  data = hotel_data,
                  method = "rf",
                  trControl = trainControl(method = "cv", number = 5),
                  tuneGrid = data.frame(mtry = 3),
                  ntree = ntree_val)


print(cv_model)
## Random Forest 
## 
## 36275 samples
##     7 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 29020, 29020, 29020, 29020, 29020 
## Resampling results:
## 
##   Accuracy  Kappa
##   0.865     0.679
## 
## Tuning parameter 'mtry' was held constant at a value of 3
stopCluster(cl)

Cross-validation was performed to ensure the reliability and robustness of the random forest model. It helps to evaluate the model’s performance across multiple subsets of the data, reducing the risk of overfitting and providing a more generalizable estimate of accuracy. Specifically, 5-fold cross-validation divides the dataset into five equally sized folds, training the model on four folds and testing it on the remaining fold, iteratively. This ensures that every data point is used for both training and testing, providing a reliable estimate of model performance.

Cross-validation was conducted to validate the model’s predictive performance and reduce overfitting. The accuracy (86.8%) and Kappa (0.686) indicate that the random forest model is robust and provides reliable predictions for booking cancellations and non-cancellations. This approach ensures that the model will generalize well to unseen data.

Conclusion

In this analysis, we successfully developed and evaluated a random forest model to predict hotel booking cancellations based on key features such as lead time, number of special requests, average room price, and market segment type. Through rigorous methods like 5-fold cross-validation, we demonstrated that the model achieves a high accuracy of 86.8%, with a Kappa value of 0.686, indicating substantial agreement between predictions and actual outcomes. The model’s robustness was further validated through performance metrics, including an out-of-bag (OOB) error rate of 10.9%.

Key findings highlight that lead time and the number of special requests are the most influential predictors, providing actionable insights for hotels to optimize revenue management strategies. While the model performs well overall, higher misclassification rates for canceled bookings suggest opportunities for improvement by incorporating additional predictors or addressing class imbalance.

This analysis emphasizes the importance of data-driven decision-making in the hospitality industry, offering valuable tools to anticipate cancellations and enhance operational efficiency. Future work could focus on integrating additional features, such as customer demographics or external factors (e.g., economic trends), to further improve prediction accuracy and model interpretability.

References

  1. Antonio, N., de Almeida, A., & Nunes, L. (2017). Predicting hotel bookings cancellation with a machine learning classification model. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW) (pp. 1100–1107). IEEE. https://ieeexplore.ieee.org/document/8260781

  2. Antonio, N., de Almeida, A., & Nunes, L. (2017). Predicting hotel booking cancellations to decrease uncertainty and increase revenue.Tourism & Management Studies, 13(2), 25–39. https://www.researchgate.net/publication/310504011_Predicting_Hotel_Booking_Cancellation_to_Decrease_Uncertainty_and_Increase_Revenue

  3. Abeyrathne, C., & Bandara, H. M. N. D. (2023). Hotel booking cancellation prediction system using machine learning. Faculty of Engineering, University of Ruhuna. https://www.researchgate.net/publication/380515766_Hotel_Booking_Cancellation_Prediction_System_using_Machine_Learning